NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Multi-Target Embodied Question Answering

Yu, Licheng; Chen, Xinlei; Gkioxari, Georgia; Bansal, Mohit; Berg, Tamara L; Batra, Dhruv (January 2019, IEEE Conference on Computer Vision and Pattern Recognition)

Embodied Question Answering (EQA) is a relatively new task where an agent is asked to answer questions about its environment from egocentric perception. EQA as introduced in [8] makes the fundamental assumption that every question, e.g. “what color is the car?”, has exactly one target (“car”) being inquired about. This assumption puts a direct limitation on the abilities of the agent. We present a generalization of EQA – Multi-Target EQA (MT-EQA). Specifically, we study questions that have multiple targets in them, such as “Is the dresser in the bedroom bigger than the oven in the kitchen?”, where the agent has to navigate to multiple locations (“dresser in bedroom”, “oven in kitchen”) and perform comparative reasoning (“dresser” bigger than “oven”) before it can answer a question. Such questions require the development of entirely new modules or components in the agent. To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module. The program generator converts the given question into sequential executable sub-programs; the navigator guides the agent to multiple locations pertinent to the navigation-related sub-programs; and the controller learns to select relevant observations along its path. These observations are then fed to the VQA module to predict the answer. We perform detailed analysis for each of the model components and show that our joint model can outperform previous methods and strong baselines by a significant margin. Project page: https://embodiedqa.org.
more » « less
Full Text Available
Visual to Sound: Generating Natural Sound for Videos in the Wild

Zhou, Yipin; Wang, Zhaowen; Fang, Chen; Bui, Trung; Berg, Tamara L. (June 2018, IEEE Conference on Computer Vision and Pattern Recognition)

As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.
more » « less
Full Text Available
TVQA: Localized, Compositional Video Question Answering

Lei, Jie; Yu, Licheng; Bansal, Mohit; Berg, Tamara L. (January 2018, Empirical Methods in Natural Language Processing)

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a largescale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.
more » « less
Full Text Available
MAttNet: Modular Attention Network for Referring Expression Comprehension

Yu, Licheng; Lin, Zhe; Shen, Xiaohui; Yang, Jimei; Lu, Xin; Bansal, Mohit; Berg, Tamara L. (June 2018, IEEE Conference on Computer Vision and Pattern Recognition)

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-the-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo and code are provided.
more » « less
Full Text Available
Combining Multiple Cues for Visual Madlibs Question Answering

https://doi.org/10.1007/s11263-018-1096-0

Tommasi, Tatiana; Mallya, Arun; Plummer, Bryan; Lazebnik, Svetlana; Berg, Alexander C.; Berg, Tamara L. (April 2018, International Journal of Computer Vision)

This paper presents an approach for answering ﬁll-in-the-blank multiple choice questions from the Visual Madlibs dataset.Instead of generic and commonly used representations trained on the ImageNet classiﬁcation task, our approach employs acombination of networks trained for specialized tasks such as scene recognition, person activity classiﬁcation, and attributeprediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support forfeature extraction. We map each of these features, together with candidate answers, to a joint embedding space throughnormalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scoresfrom nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a signiﬁcantimprovement over the previous state of the art and conﬁrm that answering questions from a wide range of types beneﬁts fromexamining a variety of image cues and carefully choosing the spatial support for feature extraction.
more » « less
Full Text Available

Search for: All records